Change from thread idx based predicate to warp idx based by liqiangxl · Pull Request #5821 · NVIDIA/Fuser

liqiangxl · 2026-01-14T16:43:39Z

Test:
NVFUSER_DUMP=ptx,sass,scheduler_params,launch_param,cuda_to_file ./test_nvfuser --gtest_filter=*ThunderRMSNormBwd*bfloat*16384* 2>&1 |tee new.log

SASS Code Comparison: old.log vs new3.log

Summary

Use UniformWarpId() instead of threadIdx.x/y/z for warp specialization predicates.

Key Differences

1. Warp ID Calculation

OLD (old.log):

Uses threadIdx.x directly (stored in R5)
No warp ID computation

NEW (new3.log):

/*01c0*/ SHF.R.U32.HI R6, RZ, 0x5, R0 ;    // R6 = R0 >> 5 (divide by 32 for warp ID)
/*0210*/ SHFL.IDX PT, R6, R6, RZ, 0x1f ;   // Shuffle to make warp ID uniform across warp

Computes warp ID from thread index: R0 >> 5 (equivalent to tid / 32)
Uses SHFL.IDX instruction to ensure all threads in warp have the same value
This helps PTXAS prove uniformity for better optimization

2. Predicate Comparison

OLD (old.log):

/*0790*/ ISETP.GE.U32.AND P0, PT, R5, 0x100, PT ;

Compares: threadIdx.x >= 256 (0x100)
Direct thread index comparison

NEW (new3.log):

/*0660*/ ISETP.GE.U32.AND P0, PT, R6, 0x8, PT ;

Compares: UniformWarpId >= 8 (0x8)
Warp-level comparison (256/32 = 8 warps)

3. Code Size

OLD: Function size ends at label .L_x_42
NEW: Function size ends at label .L_x_44
Slightly increased code size due to additional warp ID computation

4. Register Usage

Both versions use similar register allocation patterns, but:

OLD: Uses R5 for threadIdx.x throughout
NEW: Uses R6 for UniformWarpId, computed early in prologue

5. Thread Index Reads

OLD (old.log):

/*0050*/ S2R R5, SR_TID.X ;

Single read of threadIdx.x

NEW (new3.log):

/*0050*/ S2R R2, SR_TID.Y ;
/*0070*/ S2R R0, SR_TID.Z ;
/*00a0*/ S2R R3, SR_TID.X ;
/*0120*/ IMAD R0, R0, UR7, R2 ;
/*0170*/ IMAD R0, R0, UR10, R3 ;

Reads all thread indices (X, Y, Z)
Computes linearized thread ID: tid = threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x
This allows correct warp ID calculation in multi-dimensional blocks

Benefits of the New Approach

Warp Uniformity: The SHFL.IDX instruction explicitly proves to PTXAS that all threads in a warp have the same warp ID value
Better Optimization: PTXAS can optimize better when it knows values are uniform across warps
Semantically Correct: Using warp ID is more semantically appropriate for warp-level decisions
Generalization: Works correctly for multi-dimensional thread blocks

Performance Impact

🚀 Instruction Count Improvements (MAJOR WINS!)

Instruction Type	OLD	NEW	Change	Impact
WARPSYNC.ALL	5	3	-2 (-40%)	🔥 High impact - Each saves 10-20+ cycles
VOTEU.ALL	9	1	-8 (-89%)	🔥 Critical - Warp-level ballot expensive
ISETP	34	29	-5 (-15%)	✅ Fewer predicate comparisons
ELECT	10	10	0	Unchanged
BAR.SYNC	7	7	0	Unchanged
SYNCS	22	22	0	Unchanged

⚡ WARPSYNC Instruction Reduction

Reduction: 2 fewer WARPSYNC instructions (40% reduction!)

Where WARPSYNC was eliminated:

OLD version had extra WARPSYNC at:

/*0340*/ WARPSYNC.ALL    // Before multiple ELECT instructions
         ELECT P0, ...
         ELECT P2, ...
         ELECT P3, ...
         ELECT P4, ...
         ISETP.GT.U32.OR P0, PT, R5, 0x1f, !P0
         ISETP.GT.U32.OR P2, PT, R5, 0x1f, !P2
         ...
         
/*0830*/ WARPSYNC.ALL    // Before register deallocation
         USETMAXREG.DEALLOC.CTAPOOL 0x28

NEW version simplified to:

/*03c0*/ WARPSYNC.ALL    // Single WARPSYNC, cleaner control flow
         BAR.SYNC.DEFER_BLOCKING 0x0
         ISETP.NE.AND P0, PT, R6, RZ, PT

💡 Why This Happens

The SHFL.IDX instruction explicitly proves to PTXAS that all threads in a warp have the same warp ID. This allows the compiler to:

Eliminate VOTEU.ALL instructions (-8 instructions, -89%): When PTXAS knows warp ID is uniform, it doesn't need expensive warp-level ballot operations to check uniformity
- OLD pattern: ISETP → VOTEU.ALL → Check result
- NEW pattern: Direct check (uniformity already proven)
Eliminate redundant WARPSYNC (-2 instructions, -40%): When PTXAS knows values are uniform, it doesn't need as many WARPSYNC barriers before ELECT operations
Simplify predicates (-5 ISETP instructions): Fewer comparison operations needed when uniformity is established early
Optimize control flow: Better branch prediction and divergence handling

📊 Overall Performance Impact

Cost vs Benefit Analysis:

Added overhead:

+2 instructions in prologue: SHF.R.U32.HI + SHFL.IDX
Cost: ~2-3 cycles (single-cycle ops)

Removed expensive instructions:

-2 WARPSYNC.ALL: Saves 20-40+ cycles (each can stall 10-20 cycles)
-8 VOTEU.ALL: Saves 80-160+ cycles (each can take 10-20 cycles)
-5 ISETP: Saves 5-10 cycles

Net Performance Gain:

Estimated: 100-200+ cycles saved per kernel invocation
Critical path improvements: Fewer stalls in main computation loop
Better ILP: Compiler can schedule instructions more efficiently
Register pressure: Similar overall (minor improvements possible)

🎯 Summary

This is a clear win! Trading 2-3 cycles for 100-200+ cycles saved, with the biggest gains from:

VOTEU.ALL elimination (89% reduction) - removes expensive warp ballot operations
WARPSYNC reduction (40% reduction) - reduces synchronization overhead
Cleaner control flow - better compiler optimization opportunities

github-actions · 2026-01-14T16:44:52Z

Review updated until commit 122ac43

Description

Replace threadIdx-based predicates with uniform warp ID-based predicates for warp specialization
Add computeUniformWarpId() method to calculate flat thread ID and derive warp ID (tid/32)
Implement canUseWarpIdBasedPredicate() to validate consecutive warp ID requirements
Update predicate generation in selectFirstWarpElectSyncPredicate() and createElectSyncPredicateAsync()
Add getNumComputeWarps() method to compute total compute warps across dimensions

Changes walkthrough

Relevant files

Enhancement

codegen.cpp `Update alignment check for warp specialization` csrc/codegen.cpp Modify template argument from isAligned() to has_warp_specialized_ ? false : isAligned() This affects alignment checks for warp specialized kernels	+1/-1
allocation.cpp `Add uniform warp ID computation infrastructure` csrc/device_lower/pass/allocation.cpp Add computeUniformWarpId() method to calculate flat thread ID and warp ID Modify getNumComputeThreadsEachBlock() calls to add only_count_same_compute_warp_groups parameter Add uniform warp ID computation at kernel start when canUseWarpIdBasedPredicate() is true	+50/-1
circular_buffer.cpp `Update circular buffer predicates for warp ID` csrc/device_lower/pass/circular_buffer.cpp Update getAsyncWarpPredicate() to use uniform warp ID when available Fall back to thread index comparison when uniform warp ID is not available Modify initializePingPongMbarrier() to use new parameter	+21/-8
parallel_dimension_map.cpp `Add warp ID computation and validation methods` csrc/parallel_dimension_map.cpp Add getNumComputeWarps() method to compute warps by dividing threads by 32 Add canUseWarpIdBasedPredicate() method to validate consecutive warp ID requirements Modify getNumComputeThreadsEachBlock() to accept filtering parameter Move getWarpSpecializationPaddedVal() method	+58/-16
predicate_compute.cpp `Update elect sync predicates for warp ID` csrc/predicate_compute.cpp Update selectFirstWarpElectSyncPredicate() to use uniform warp ID when available Modify createElectSyncPredicateAsync() to leverage warp ID-based predicates Update createMultipleExpressionElectSync() to support new warp ID approach	+52/-7
lower2device.h `Add uniform warp ID infrastructure` csrc/device_lower/lower2device.h Add uniformWarpId() getter method Add setUniformWarpId() setter method Add uniform_warp_id_ member variable	+14/-0
parallel_dimension_map.h `Update parallel dimension map interface` csrc/parallel_dimension_map.h Add getNumComputeWarps() method declaration Add canUseWarpIdBasedPredicate() method declaration Update getNumComputeThreadsEachBlock() signature with filtering parameter Add comprehensive documentation for new methods	+33/-4

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Error Handling

The computeUniformWarpId() function doesn't handle cases where parallel dimensions might be null or invalid. It should include proper error checking for the NamedScalar::getParallelIndex() calls and ensure all dimensions are properly validated before computation.

void computeUniformWarpId(Expr* expr) {
  // Compute flat thread id: tid = threadIdx.x + threadIdx.y * blockDim.x +
  // threadIdx.z * blockDim.x * blockDim.y
  const auto& pdim = GpuLower::current()->info().parallelDimensionMap();
  Val* tid = FusionGuard::getCurFusion()->zeroVal();
  Val* bdimx = pdim.getRaw(ParallelType::TIDx);
  Val* bdimy = pdim.getRaw(ParallelType::TIDy);
  Val* bdimz = pdim.getRaw(ParallelType::TIDz);

  if (bdimx != nullptr) {
    tid = NamedScalar::getParallelIndex(ParallelType::TIDx);
  }
  if (bdimy != nullptr) {
    Val* tidy = NamedScalar::getParallelIndex(ParallelType::TIDy);
    if (bdimx != nullptr) {
      tidy = SimplifyingIrBuilder::mulExpr(tidy, bdimx);
    }
    tid = SimplifyingIrBuilder::addExpr(tid, tidy);
  }
  if (bdimz != nullptr) {
    Val* tidz = NamedScalar::getParallelIndex(ParallelType::TIDz);
    if (bdimy != nullptr) {
      tidz = SimplifyingIrBuilder::mulExpr(tidz, bdimy);
    }
    if (bdimx != nullptr) {
      tidz = SimplifyingIrBuilder::mulExpr(tidz, bdimx);
    }
    tid = SimplifyingIrBuilder::addExpr(tid, tidz);
  }

  // Compute warp_id = tid / 32
  Val* warp_size = IrBuilder::create<Val>(32L, DataType::Index);
  Val* warp_id = SimplifyingIrBuilder::divExpr(tid, warp_size);

  // Cast to UInt32 for use in predicates and store in GpuLower
  Val* uniform_warp_id =
      SimplifyingIrBuilder::maybeCastExpr(DataType::UInt32, warp_id);

  GpuLower::current()->setUniformWarpId(uniform_warp_id);
}

Logic Validation

The canUseWarpIdBasedPredicate() method has complex logic for checking consecutive warp IDs. The condition checking thread_count == -1 || thread_count > 1 should be validated to ensure it correctly handles all edge cases, especially when dimensions are dynamically sized.

bool ParallelDimensionMap::canUseWarpIdBasedPredicate() const {
  if (!hasWarpSpecialization()) {
    return false;
  }

  // For consecutive warp IDs, all dimensions after the warp-specialized
  // dimension must be 1. Otherwise outer dimensions create gaps in warp IDs.
  NVF_ERROR(warp_specialized_parallel_type_.has_value());
  ParallelType ws_pt = warp_specialized_parallel_type_.value();

  bool found_ws_pt = false;
  for (ParallelType pt : kParallelTypeTIDs) {
    if (pt == ws_pt) {
      found_ws_pt = true;
    } else if (found_ws_pt) {
      int64_t thread_count = getThreadCountInDim(pt);
      if (thread_count == -1 || thread_count > 1) {
        return false;
      }
    }
  }

  return true;
}

Fallback Logic

Multiple functions now have fallback logic when uniform_warp_id is nullptr. The consistency and correctness of these fallback paths should be verified to ensure they produce equivalent results to the original thread-idx based predicates.

Val* selectFirstWarpElectSyncPredicate(bool is_warp_collective) {
  // If uniform warp ID is available, use warp-ID-based predicate
  Val* uniform_warp_id = GpuLower::current()->uniformWarpId();
  if (uniform_warp_id != nullptr) {
    Val* target_warp_index = IrBuilder::create<Val>(0u, PrimDataType::UInt32);
    Val* select_warp = IrBuilder::eqExpr(uniform_warp_id, target_warp_index);
    if (is_warp_collective) {
      return select_warp;
    }
    return SimplifyingIrBuilder::logicalAndExpr(
        select_warp, createElectSyncExpr());
  }

  Val* warp_size = IrBuilder::create<Val>(32L, PrimDataType::UInt64);
  Val* select_first_warp = IrBuilder::ltExpr(
      NamedScalar::getParallelIndex(ParallelType::TIDx), warp_size);

  // Short-Circuit: TMA Store is a warp-collective, so ElectSync is not
  // necessary.
  if (is_warp_collective) {
    return select_first_warp;
  }

  return SimplifyingIrBuilder::logicalAndExpr(
      createElectSyncExpr(), select_first_warp);
}

Test failures

(Medium, 5) NVFuser internal assert: NumComputeWarps not constant in TmaPersistent tests

Test Name	GB200	Source
TmaPersistentTestF.KernelReuse	❌	Link
TmaPersistentTestP.TmaInnerPersistentRmsNorm/__bfloat_2048_1024	❌	Link
TmaPersistentTestP.TmaInnerPersistentRmsNorm/__bfloat_2048_2048	❌	Link
TmaPersistentTestP.TmaInnerPersistentRmsNorm/__bfloat_2048_4096	❌	Link
TmaPersistentTestP.TmaInnerPersistentRmsNorm/__bfloat_2048_8192	❌	Link

(Medium, 2) Thunder nvFuser nanoGPT autograd scalar mismatch in thunder.tests.test_networks

Test Name GB200 H100 Source

thunder.tests.test_networks.test_nanogpt_complete_autograd_nvfuser_cuda_thunder.dtypes.float32 ❌ ❌
(Medium, 2) nvFuser warp_specialize codegen assertion failure in tests.python.direct.test_tutorial

Test Name GB200 H100 Source

tests.python.direct.test_tutorial.test_warp_specialized_circular_buffering_pointwise ❌ ❌

greptile-apps · 2026-01-14T16:49:20Z

Greptile Summary

This PR refactors warp-specialized kernel predicate generation to use uniform warp IDs instead of thread indices, providing significant performance improvements through compiler optimization.

Key Changes:

Added uniformWarpId() computation in the allocation pass that calculates a flat warp ID from thread indices (tid = threadIdx.x + threadIdx.y * blockDim.x + threadIdx.z * blockDim.x * blockDim.y, then warp_id = tid / 32)
Added canUseWarpIdBasedPredicate() validation to ensure warp-ID-based predicates only generate consecutive warp IDs (all dimensions after warp-specialized dimension must be 1)
Refactored three predicate generation functions (selectFirstWarpElectSyncPredicate(), createElectSyncPredicateAsync(), createMultipleExpressionElectSync()) to use warp-ID-based predicates when available, with fallback to thread-index-based predicates
Updated getAsyncWarpPredicate() in circular buffer logic to use warp-ID-based comparison when available
Updated getNumComputeThreadsEachBlock() to accept only_count_same_compute_warp_groups parameter for more flexible thread counting

Performance Impact:
The PR demonstrates substantial performance gains from enabling the compiler to prove warp uniformity:

WARPSYNC instructions reduced by 40% (2 fewer instructions)
VOTEU.ALL instructions reduced by 89% (8 fewer instructions)
ISETP instructions reduced by 15% (5 fewer instructions)
Estimated 100-200+ cycles saved per kernel invocation
Minimal code overhead (2-3 cycles for warp ID computation)

Architecture:
The changes maintain backward compatibility through conditional checks: warp-ID-based predicates are only computed when canUseWarpIdBasedPredicate() returns true, otherwise the original thread-index-based predicates are used. This ensures correctness across all warp specialization configurations.

Confidence Score: 4/5

This PR is mostly safe to merge with minor verification needed on the alignment optimization flag change.
The PR introduces well-designed optimizations to warp specialization predicates with proper validation (canUseWarpIdBasedPredicate()) and fallback mechanisms. The logic flow is correct: uniform warp ID is only computed when consecutive warp IDs are guaranteed, and all affected predicates have proper fallbacks. The warp ID computation itself is mathematically correct (flat thread index then divide by 32). However, the score is 4 instead of 5 because: (1) The codegen.cpp change disables alignment optimization for warp-specialized kernels without inline documentation explaining the rationale, though the change itself appears sound; (2) The implementation relies on several interacting validation checks that would benefit from explicit testing to verify all edge cases (e.g., dynamic thread counts, different warp specialization dimensions).
csrc/codegen.cpp needs verification that disabling alignment optimization for warp specialization doesn't have unintended side effects on reduction correctness or performance.

Important Files Changed

Filename	Overview
csrc/device_lower/lower2device.h	Added `uniformWarpId()` getter and `setUniformWarpId()` setter methods with a private member variable `uniform_warp_id_` to store the computed warp ID scalar. Clean addition with appropriate encapsulation and const-correctness.
csrc/device_lower/pass/allocation.cpp	Added `computeUniformWarpId()` method that calculates flat thread ID from thread indices and divides by 32 to get warp ID. Integrated into AllocationInserter constructor with proper condition check (`canUseWarpIdBasedPredicate()`). Updated `initializeCircularBufferMbarrier()` to pass parameter to `getNumComputeThreadsEachBlock()`. Logic correctly ensures warp ID is only computed when consecutive warps are guaranteed.
csrc/parallel_dimension_map.h	Added `canUseWarpIdBasedPredicate()` method declaration and `getNumComputeWarps()` method declaration with comprehensive documentation. Updated `getNumComputeThreadsEachBlock()` signature to accept `only_count_same_compute_warp_groups` parameter. Made `getThreadCountInDim()` const. All changes are well-documented with clear semantics and examples.
csrc/parallel_dimension_map.cpp	Implemented `getNumComputeWarps()` which divides compute threads by 32 (warp size). Implemented `canUseWarpIdBasedPredicate()` which validates that all dimensions after the warp-specialized dimension are 1 to ensure consecutive warp IDs. Updated `getNumComputeThreadsEachBlock()` logic to handle the new parameter. Moved function bodies appropriately (moved `getWarpSpecializationPaddedVal()` and added `getNumComputeWarps()`). All implementations are logically sound with proper error checking.
csrc/predicate_compute.cpp	Updated three predicate functions to check for `uniformWarpId()` first and use warp-ID-based predicates when available, with fallback to thread-ID-based predicates. In `selectFirstWarpElectSyncPredicate()` uses `warp_id == 0`. In `createElectSyncPredicateAsync()` uses `warp_id == num_compute_warps`. In `createMultipleExpressionElectSync()` branches on async warp loop presence. The logic is correct but involves multiple checks; needs verification that the fallback paths are still correct for cases where `uniformWarpId()` is nullptr.
csrc/device_lower/pass/circular_buffer.cpp	Updated `getAsyncWarpPredicate()` to use warp-ID-based predicate (`warp_id >= num_compute_warps`) when `uniformWarpId()` is available, with fallback to parallel index comparison. Updated mbarrier initialization calls to pass `true` parameter to `getNumComputeThreadsEachBlock()`. Changes maintain backward compatibility while improving optimization when warp IDs are available.
csrc/codegen.cpp	Changed template argument for alignment from `isAligned()` to `has_warp_specialized_ ? false : isAligned()`. This disables alignment optimization when warp specialization is used. The change appears intended but the rationale isn't documented in the code - verify this is intentional and doesn't break performance assumptions.

Sequence Diagram

sequenceDiagram
    participant Compiler as NVIDIA Fuser Compiler
    participant Alloc as AllocationInserter
    participant PDM as ParallelDimensionMap
    participant PC as PredicateCompute
    participant CB as CircularBuffer
    
    Compiler->>Alloc: Start allocation pass
    activate Alloc
    
    Alloc->>PDM: Check canUseWarpIdBasedPredicate()
    activate PDM
    PDM-->>Alloc: Returns true if consecutive warp IDs possible
    deactivate PDM
    
    alt Warp ID based predicates available
        Alloc->>Alloc: Compute uniform warp ID<br/>(tid = tidx + tidy*bdimx + tidz*bdimx*bdimy)<br/>(warp_id = tid / 32)
        Alloc->>Compiler: Store uniform_warp_id in GpuLower
        
        Compiler->>PC: Generate predicates
        activate PC
        
        PC->>PC: selectFirstWarpElectSyncPredicate()
        Note over PC: Use warp_id == 0
        PC->>PC: createElectSyncPredicateAsync()
        Note over PC: Use warp_id == num_compute_warps
        PC->>PC: createMultipleExpressionElectSync()
        Note over PC: Route to async/compute warp predicates
        
        deactivate PC
        
        Compiler->>CB: getAsyncWarpPredicate()
        activate CB
        CB->>PDM: Get getNumComputeWarps()
        Note over CB: Use warp_id >= num_compute_warps
        CB-->>Compiler: Return warp ID based predicate
        deactivate CB
        
    else No consecutive warp IDs (fallback)
        Note over Alloc: Skip warp ID computation
        Alloc-->>Compiler: uniform_warp_id = nullptr
        
        Compiler->>PC: Generate predicates (fallback)
        activate PC
        PC->>PC: Use thread index based predicates<br/>(threadIdx.x < 32 for first warp)<br/>(linear_index >= computed_threshold for async)
        deactivate PC
        
        Compiler->>CB: getAsyncWarpPredicate (fallback)
        activate CB
        CB->>PDM: Use getWarpSpecializationPaddedVal()
        Note over CB: Use parallel_index >= (raw - padding)
        CB-->>Compiler: Return thread index based predicate
        deactivate CB
    end
    
    Compiler->>Compiler: Generate optimized PTX/SASS<br/>with reduced WARPSYNC/VOTEU.ALL instructions
    deactivate Alloc

liqiangxl · 2026-01-14T18:57:59Z

!test

greptile-apps

_{15 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-14T19:07:05Z

csrc/predicate_compute.cpp

+  // Val* zero = IrBuilder::create<Val>(0L, PrimDataType::UInt64);
+  // Val* warp_size = IrBuilder::create<Val>(32L, PrimDataType::UInt64);
+
+  // const ParallelDimensionMap& pdim_map =
+  //     GpuLower::current()->info().parallelDimensionMap();
+  // Val* async_warp_thread_index = pdim_map.getLinearThreadIndexAsync();
+  // Val* warp_id =
+  //     SimplifyingIrBuilder::divExpr(async_warp_thread_index, warp_size);
+  // // TODO Only select first warp now
+  // Val* select_warp = SimplifyingIrBuilder::eqExpr(warp_id, zero);
+
+  // // Use elect-sync if available
+  // if (pdim_map.canUseElectSyncInAsyncWarp()) {
+  //   return SimplifyingIrBuilder::logicalAndExpr(
+  //       select_warp, createElectSyncExpr());
+  // }
+
+  // // Warp Specialized ParallelType is ThreadIdx.x and it contains less than
+  // 32
+  // // threads, so manually select first thread in warp.
+  // Val* thread_id =
+  //     SimplifyingIrBuilder::modExpr(async_warp_thread_index, warp_size);
+  // Val* select_thread = SimplifyingIrBuilder::eqExpr(thread_id, zero);
+  // return SimplifyingIrBuilder::logicalAndExpr(select_warp, select_thread);


style: Remove commented-out legacy code

liqiangxl · 2026-01-14T19:53:35Z

!test

liqiangxl · 2026-01-15T03:38:08Z

!test

liqiangxl · 2026-01-15T14:05:52Z

!test

liqiangxl · 2026-01-15T17:15:36Z

!test

jacobhinkle · 2026-01-21T13:32:44Z

csrc/device_lower/pass/allocation.cpp

+
+    // Compute warp_id = tid / 32
+    Val* warp_size = IrBuilder::create<Val>(32L, DataType::Index);
+    Val* warp_id = SimplifyingIrBuilder::divExpr(tid, warp_size);


It seems like there is nothing guaranteeing this will be warp-uniform since the compiler cannot know the block size so unless TIDz and TIDy are both >1 then it won't know that tid is the linear thread ID. So do we need to do a warp broadcast? See #2323.

yes, we need something like

// __shfl_sync helps PTXAS prove that every thread in the warp has the same // uniform warp id. __device__ __forceinline__ uint32_t getUniformWarpId() { const unsigned int tid = threadIdx.x + threadIdx.y * blockDim.x + threadIdx.z * blockDim.x * blockDim.y; const unsigned int warp_id = tid / 32; return __shfl_sync(0xFFFFFFFF, warp_id, 0); }

This PR is not ready yet.

liqiangxl added 2 commits January 14, 2026 07:38

add getUniformWarpId

6272da4

use uniform warp id

3b30835

liqiangxl added 2 commits January 14, 2026 08:49

clean

35f84c8

wip

0dda543

liqiangxl changed the title ~~use uniform warp id to select first warp~~ Change from thread idx based predicate to warp idx based Jan 14, 2026

use getNumComputeWarps

c403dc7

greptile-apps bot reviewed Jan 14, 2026

View reviewed changes

liqiangxl added 3 commits January 14, 2026 11:21

clean

5d4d0d0

clea

dcf233d

directly compute warp id

ded2693

liqiangxl force-pushed the llu/uniform_warp_id branch from 0acf1bd to ded2693 Compare January 14, 2026 19:53

Merge branch 'main' into llu/uniform_warp_id

b29430d

liqiangxl added 2 commits January 14, 2026 19:13

Merge branch 'main' into llu/uniform_warp_id

5189c11

only for warp specialized

adebdd0

clean

459df7e

fix

122ac43

jacobhinkle reviewed Jan 21, 2026

View reviewed changes

liqiangxl marked this pull request as draft January 21, 2026 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change from thread idx based predicate to warp idx based#5821

Change from thread idx based predicate to warp idx based#5821
liqiangxl wants to merge 13 commits intomainfrom
llu/uniform_warp_id

liqiangxl commented Jan 14, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 14, 2026 •

edited by xwang233

Loading

Changes walkthrough

PR Reviewer Guide

Test failures

Uh oh!

greptile-apps bot commented Jan 14, 2026 •

edited

Loading

Uh oh!

liqiangxl commented Jan 14, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jan 14, 2026

Uh oh!

liqiangxl commented Jan 14, 2026

Uh oh!

liqiangxl commented Jan 15, 2026

Uh oh!

liqiangxl commented Jan 15, 2026

Uh oh!

liqiangxl commented Jan 15, 2026

Uh oh!

jacobhinkle Jan 21, 2026

Uh oh!

liqiangxl Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

liqiangxl commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SASS Code Comparison: old.log vs new3.log

Summary

Key Differences

1. Warp ID Calculation

OLD (old.log):

NEW (new3.log):

2. Predicate Comparison

OLD (old.log):

NEW (new3.log):

3. Code Size

4. Register Usage

5. Thread Index Reads

OLD (old.log):

NEW (new3.log):

Benefits of the New Approach

Performance Impact

🚀 Instruction Count Improvements (MAJOR WINS!)

⚡ WARPSYNC Instruction Reduction

Where WARPSYNC was eliminated:

💡 Why This Happens

📊 Overall Performance Impact

🎯 Summary

Uh oh!

github-actions bot commented Jan 14, 2026 • edited by xwang233 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Test failures

Uh oh!

greptile-apps bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

liqiangxl commented Jan 14, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

liqiangxl commented Jan 14, 2026

Uh oh!

liqiangxl commented Jan 15, 2026

Uh oh!

liqiangxl commented Jan 15, 2026

Uh oh!

liqiangxl commented Jan 15, 2026

Uh oh!

jacobhinkle Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

liqiangxl Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

liqiangxl commented Jan 14, 2026 •

edited

Loading

github-actions bot commented Jan 14, 2026 •

edited by xwang233

Loading

greptile-apps bot commented Jan 14, 2026 •

edited

Loading